My data is from GSE84054 (Goh et al. 2017)
What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.
Loaded my normalized data from A1.
normalized_count_data <- read.table(file="GSE84054_normalized_count.txt")
kable(normalized_count_data[1:5, 1:5], type="html")
We need to explore the data again after normalization to ensure the normalized data reaches our expectations. 1. boxplot - normalized
# After normalization
data2plot_after <- log2(normalized_count_data[,3:ncol(normalized_count_data)])
{boxplot(data2plot_after, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Filtered and normalized RNASeq Samples")
abline(h = median(apply(data2plot_after, 2, median)), col = "green", lwd = 0.6, lty = "dashed")}
counts_density <- apply(log2(normalized_count_data[,3:ncol(normalized_count_data)]), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",cex.lab = 0.85,
main = "Filtered and normalized RNASeq Samples distribution")
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(data2plot), col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4", merge = TRUE, bg = "gray90")
heatmap_matrix <- normalized_count_data[,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENEID
colnames(heatmap_matrix) <- rownames(samples_filtered)
# MDS plot by "cell_type" in samples
plotMDS(heatmap_matrix, labels=rownames(samples_filtered),
col = c("darkgreen","blue")[factor(samples_filtered$cell_type)],
main = "MDS plot depending on cell type")
pat_colors <- rainbow(12)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
# MDS plot by "cell_type" + "patients"in samples
plotMDS(heatmap_matrix, col = pat_colors,
main = "MDS plot depending on both cell type and patients")
Based on the two models in part1, I decide to base my model only on “cell_type”
model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE,
main = "Binomial distribution of my data")
The individual dots represent each gene and the blue line is the overall trend line.
plotBCV(d,col.tagwise = "black",col.common = "red",
main = "BCV plot of RNA-seq data")
I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. The result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well(Goh et al. 2017) . There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.
I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis on gProfileR.
I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from (Annick Moisan, n.d.)
volcanoData <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(volcanoData) <- c("logFC", "Pval")
up <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
point.col <- ifelse(up, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Up-regulated genes in RNA-seq data")
down <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(down, "blue", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Down-regulated genes in RNA-seq data")
To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to show that differential expression exists.
top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits),pattern = "_P"),
grep(colnames(heatmap_matrix_tophits),pattern = "_S"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,)
current_heatmap
There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.
The some analysis is apply to down regulated
Annick Moisan, Nathalie Villa-Vialaneix, Ignacio Gonzales. n.d. “Practical Statistical Analysis of Rna-Seq Data - edgeR - Tomato Data.”
Goh, Jian Yuan, Min Feng, Wenyu Wang, Gokce Oguz, Siti Maryam JM Yatim, Puay Leng Lee, Yi Bao, et al. 2017. “Chromosome 1q21.3 Amplification Is a Trackable Biomarker and Actionable Target for Breast Cancer Recurrence.” Nature Medicine 23 (11). Nature Publishing Group: 1319.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.
Kolberg, Liis, and Uku Raudvere. 2019. Gprofiler2: Interface to the ’G:Profiler’ Toolset. https://CRAN.R-project.org/package=gprofiler2.
Rainer, Johannes. 2017. EnsDb.Hsapiens.v75: Ensembl Based Annotation Package.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Vasiliou, Vasilis, and Daniel W Nebert. 2005. “Analysis and Update of the Human Aldehyde Dehydrogenase (Aldh) Gene Family.” Human Genomics 2 (2). BioMed Central: 138.
Vassalli, Giuseppe. 2019. “Aldehyde Dehydrogenases: Not Just Markers, but Functional Regulators of Stem Cells.” Stem Cells International 2019. Hindawi.
Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.